1 Introduction

R statistical software and associated tools have the potential to increase your efficiency, simplify collaboration, and develop reproducible products. Often the biggest limitation to these tools is lack of awareness. This document is intended to provide you with an overview of some of the many R-related tools and is intended for individuals with at least a basic understanding of R. We will learn to create R functions, a robust project structure using R Studio, aesthetically pleasing documentation with R Markdown, and interactive tools with Shiny. In addition, we will delve into the ecosystem of packages developed by R Studio, known as the Tidyverse (e.g., dplyr, tidyr, purr, and ggplot2), which provide useful and intuitive functions for data manipulation, processing, and visualization. Overall, this document will provide you with the working knowledge of some of the most widely used R related tools available for your next project.

1.1 Goals

  1. To make you aware of the tools available to you.
    • ggplot2 for creating figures
    • dplyr and tidyr for data manipulation
    • lubridate for handling and formatting dates
    • rmarkdown for creating documents
    • shiny for creating interactive apps
  2. To provide you with the vocabulary and understanding of R syntax to…
    • Understand how to ask meaningful questions
    • Understand how to use functions
    • Interpret example code
  3. To provide you with an overview of project structure
    • Understand where to start
    • Understand tools that are available to you for project development
      • RStudio
      • R-Projects
      • Git + GitHub
      • rmarkdown

R, like any other skill, is going to require you to invest a time to practice. I cannot help you with time management, but I can provide you with an 1) awareness of tools available and 2) the vocabulary and understanding of syntax to make practicing less frustrating.

1.1.1 Tip

A good way to invest more time in R is to use R in situations where you would normally use MS Excel.

1.2 What is R?

R is an open source programming language developed for statistical computing and graphic production. “R can be considered as a differenct implementation of S”, a language that was developed at Bell Laboratories (https://www.r-project.org/about.html).

1.2.1 Benefits of Using R

  1. Reproducibility: Standardized processes (e.g., functions, loops, documentation)
    • When using MS Excel processes are often spread across multiple sheets or calculations are performed haphazardly within a single sheet. In general, this makes it very hard to interpret processes performmed and to reproduce the process.
  2. Power: Ability to perform simple and complex data manipulations, iterative processes, and calculations
    • Access to more than 10,000 packages on CRAN
    • New packages are constantly being developed
    • New features are contsantly being added to existing packages

1.2.2 R Packages

R packages are extensions of base R that provide additional features or provide alternative functionality.

  • Availability
    • CRAN (https://cran.r-project.org/)
      • The Comprehensive R Archive Network (CRAN)
      • FTP and web servers that store R Packages
      • Packages are rwquired to meet certain standards
    • GitHub (https://github.com)
      • These packages are usually under development
      • Contains development versions of many packages available on CRAN
    • Custom (http://r-pkgs.had.co.nz/)
      • You have the ability to create your own packages.

2 Quick Reference

  1. Cheat Sheets: https://www.rstudio.com/resources/cheatsheets/
  2. R Bloggers: https://www.r-bloggers.com/
  3. Questions
  4. Style Guides
  5. Books and Papers
  6. Packages

2.2 Updating Software and Packages

2.2.1 R

Run the following code in the RGui, NOT in RStudio. The RGui should be installed when you install R. On my Windows machine, I access R by clicking on the R program file, “R x64 3.5.1”.

You should get a window like this if you have opened the correct program.

This code was copied from: https://www.r-statistics.com/2013/03/updating-r-from-r-on-windows-using-the-installr-package/). Make sure R Studio is closed before running this code within the RGui.

# installing/loading the package:
if(!require(installr)) {
install.packages("installr");
require(installr)
} #load / install+load installr

# using the package:
updateR()

2.2.2 RStudio

  1. Open RStudio
  2. Click on “Help” on the toolbar
  3. Click on “Check for Updates”
  4. Follow instructions

2.2.3 R-Packages

  1. Open RStudio
  2. Click on “Tools” on the toolbar
  3. Click on “Check for Package Updates…”
  4. Follow instructions

2.2.3.1 Packages for Workshop

Please run the following code within RStudio to make sure you have all of necessary packages for this workshop installed.

  1. Open RStudio
  2. Copy the following code
package.vec <- c("tidyverse", "lubridate",
                 "knitr", "rmarkdown", "markdown", "caTools", "bitops",
                 "DT", "leaflet", "shiny", "jasonlite",
                 "data.table", "rprojroot", "viridis")

install.packages(package.vec)
  1. Paste the code into the Console within RStudio
  1. Hit Enter
    • If prompted with “Do you want to restart R prior to installing?”, select “Yes”
    • If prompted again then select “No”
  2. The packages should begin to install. This may take some time.

3 R Project

Overview

  • Easier to access files (relative path)
    • Relative paths help prevent broken paths
      • In general, DO NOT use setwd()
      • Will work if the project folder is moved to a new location on you local machine or moved to a new machine.
  • Designed to easily integrate with version control (GIT)
  • In general, all data, scripts, and output should be stored within the project directory.

3.1 Create a New R Project

  1. Create a new R project by clicking on the dropdown menu at the top right of RStudio and selecting “New Project”
  1. Select “New Directory” within the “Create Project” window
  1. Select “New Project” within the “Project Type” window
  1. Enter a project name(below I have given the name “new_project”), the project directory (where the project should live), and select “Create Project”
    • Tip: Create a “project” folder that will act as your parent directory for all R projects. This will make it much easier to navigate to and between projects.
  1. A new session specific to your R project will start within RStudio
    • There are a number of ways to tell which project is open…

4 Version Control

Version control software keeps track of changes made to files. This provides the user with the ability to revert changes back to an earlier version, a backup of the files, and simplifies collaboration efforts.

Git is a free open source version control system (https://git-scm.com/) and GitHub is a platform that allows the user to store their changes made via Git in the form of repositories (https://github.com/). R Studio has integrated Git into their R Studio IDE, making it easy to work with repositories from GitHub (https://support.rstudio.com/hc/en-us/articles/200532077-Version-Control-with-Git-and-SVN). Git will need to be installed locally (https://git-scm.com/downloads) and a GitHub account will need to be to be created (https://github.com/join?source=header-home) before these tools can be accessible from R Studio.

4.1 Git Resources

4.3 Push and Pull Repository Changes in R Studio

Whenever updates are made to the files within the R project folder, they will be queued in the “Git” tab that appears in “Environment” pane in R Studio.

4.3.1 Pull

Pull every time you open the project to make sure you have the most up-to-date version of the repository. Changes may have been made by you from a different computer or by one of your collaborators.

4.3.2 Commit

A commit is an informative message to yourself or collaborators indicating why you made a change. When the “Commit” button is selected, an “RStudio: Review Changes” window will appear. In this window all of the altered files will appear in the upper left pane. By selecting an individual file in the upper left pane, the user can see the changes that were implemented in bottom pane of the “RStudio: Review Changes” window. Deletions will appear in red, while insertions will appear green. One or more files can be staged and then the user has three options: 1) Revert, 2) Ignore, or 3) Commit.

  1. The “Revert” button will revert the staged file(s) to the previous state that is available in the GitHub repository.
  2. The “Ignore” button will add the staged file(s) to the .gitignore file. The .gitignore file informs Git that a file should not be added to the GitHub repository and subsequent changes to the file should not be added to the GitHub repository. GitHub will prevent users from pushing large data sets, and thus large data sets should be added to the .gitignore file. Also, files containing sensitive information (e.g., usernames or passwords) should be added to the .gitingore file.
  3. Staged file(s) require a commit message, an informative message indicating why a change was made, prior to being committed. All commits remain local until the “Push” button is used to send the changes to the GitHub repository.

4.3.3 Push

Push commits from R Studio to the GitHub repository.

4.4 Repository Branch

When a repository is created it consist of a single branch labeled “master.” The master branch will suffice as you first develop the app. However, you may reach a point where the master branch is functioning well (without any known issues) but you want to make some dramatic development changes. Rather than committing the changes to the master branch (potentially breaking your working product), you can create a new isolated branch to work on your development changes. In this case, the branch would clone the current state of the master, and then any edits made to the new branch would not impact the master branch.

When working in R Studio with GitHub, use the drop-down menu (located in the top right corner of the Git tab in the Environment pane) to select the branch you want to work on. In the image below I clicked on “master” and now I can see three branches are available for this project: 1) master, 2) Development, and 3) Zsmith. Simply select a name to change the branch you are working on.

4.4.1 Create a New Branch

Creating a new branch is relatively simple. There are three ways that I know how to create a repository branch: 1) via R Studio, 2) via GitHub, and 3) via Git Shell.

4.4.1.1 R Studio

In the “Git” tab of the Environment pane in R Studio, there is button for creating a new branch. Click on this button:

4.4.1.2 GitHub

Log on to your GitHub online GitHub account (https://github.com/) and navigate to the repository you are working on. Under the “Code” tab, click on the “Branch:” button. This will produce a drop down menu (as seen in the image below), where you can select an existing branch or create a new branch by typing the name you want to assign your new branch into the input box labeled “Find or create a branch.” Once you type in the new name, a new box will appear in the dropdown menu that says “Create Branch”. Click on this box to create the new branch.

4.4.1.3 Git Shell

The Git Shell can be accessed via R Studio in the Git tab of the Environment pane. Click on the “More” dropdown menu and then click “Shell…” (see image below).

A new window will appear. Use this link to understand how to create and work with branches in the Git Shell: https://git-scm.com/book/en/v2/Git-Branching-Basic-Branching-and-Merging.

4.5 Merge Branches

After you have vetted a new branch, you can merge the new branch with the master branch (or some other branch). The merge will join all the changes made in the new branch and all of the changes made in the master branch. You may run into conflict issues if both branches updated the same section of code (https://help.github.com/articles/resolving-a-merge-conflict-using-the-command-line/).

Merging branches is not as simple as creating branches. As far as I know, branches can only be merged using the Git Shell (see Git Shell to learn how to access the Git Shell). Use the following link to understand how to merge branches: https://help.github.com/articles/merging-an-upstream-repository-into-your-fork/

5 R Markdown

Free Book: R Markdown: The Definitive Guide RStudio Lessons: https://rmarkdown.rstudio.com/lesson-1.html.

Markdown is a markup language for developing and formating documents. R Markdown is an R-package that allows the user to integrate text, R-code, and R-code output into a well formatted document (e.g., HTML, MS Word, PDF).

My recommendation is to create an R Markdown file for every R-project. The intention is to document as much of the project as possible. R Markdown provides a more readable document, with better descriptions of how and why an activity was performed, than a standard R script with a few commented lines.

5.1 Benefits

5.2 Basic Overview

Use markdown syntax, some of which is shown in the table below, to format the document.

Once the document is complete (formated with markdown syntax with integrated R code) the document can be knit (rendered) using the package knitr.

Here is a simple example showing the raw R Markdown (Rmd) file before knitting (rendering) and after knitting. The colors on the far left are ther to help identify elements pre- and post-knitting.

R is not the only language supported by R Markdown. The following languages and more can be integrated into an R Markdown file.

5.3 Create a New Document

  1. Click on the new document buttion:
  2. Click on R Markdown:
  1. Provide a “Title:”, select the “Defualt Output Format:”, and click “OK”
  1. A new R Markdown document will appear with some instructions and example text/code. Delete everything after the YAML header:
---
title: "Untitled"
author: "Zachary M. Smith"
date: "September 23, 2018"
output: html_document
---

5.4 Editing

Again, your best resource for learning how to use R Markdown will be the R Markdown website (https://rmarkdown.rstudio.com/lesson-1.html), but I will describe some of the general features here.

5.4.1 YAML Header

YAML: YAML Ain’t Markup Language

5.4.1.1 Standard

5.4.1.2 Table of Contents (TOC)

5.4.1.3 Floating Table of Contents (TOC)

5.4.2 Heading Text

Heading text follows one or more hash-sign(s) (#). The number of hash-signs determines the hierarchy of headings. For example, “# Heading 1” would represent the primary heading, “## Heading 2” would represent the secondary heading, “###Heading 3” would represent the tertiary heading, and so forth.

5.4.3 Plain Text

Simply add text below the YAML header.

5.4.4 Insert Code Chunks

To insert a code chunk, press Ctrl + Alt + i in the source pane (top left pane in the default settings of Studio). A code chunk will appear:

Inside the code chunk you can write and run R-code. If you print the output of your R-code it will appear below the code chunk in the source pane and the printed output will appear in the final compiled document. This is useful for producing figures and tables.

5.5 Compile the Document

To view the html document, you must compile the document using Knit. Follow these steps to Knit the document:

  1. Find and click the Knit button (it looks like a ball of yarn) in the toolbar above the editor window.

  2. If a window appears saying “Install Required Packages” for R Markdown, install the necessary packages for knitting the document.
  3. The compiled file will be saved in the same directory as your Rmd file (your R Markdown file).

5.6 File Management

I store the R Markdown file(s) in a sub-directory labeled “markdown” within the R-project folder (rproject/markdown).

5.7 Child Documents

In general, I find that a single R Markdown file quickly becomes unwieldy. I recommend breaking the document up into multiple “child” documents and sourcing these child documents in a parent document. My child documents generally represent major subsections of the document.

I store the parent R Markdown file in the “markdown” folder (rproject/markdown) and the child R Markdown files in a sub-directory of my “markdown” folder called “sections” (rproject/markdown/sections). In the parent file, the child files are sourced within the code chunk header using “child = ‘sections/example.Rmd’. After sourcing all the child chunks, the parent file can be knit (compiled) like a normal R markdown document. The child documents cannot be run in the parent file.

5.7.1 Extract and Run R-Code from R Markdown Files

The parent file is great for organizing sections of your document, but the child documents cannot be executed within R Studio like a normal code chunk. Without the ability to easily execute the R code within the child documents it can become very difficult to develop new child documents because new child documents often depend on upstream code execution.

Imagine you have a parent document that sources child sections which import your data and clean your data. You now want to visualize your data; accordingly, you begin to develop a visualization child document, which depends on information from the upstream child sections. It would be inefficient and inappropriate to perform all the steps in the upstream child sections within the visualization section. Therefore, you need an effective way to execute the upstream child sections while you continue to develop the visualization section. The inefficient way of doing this is to open each child Rmd file in R Studio and execute them manually in the correct sequence. This becomes tedious after you have three or more documents (imagine doing this for 10+ child sections). The most efficient way that I have found to run upstream child sections is to extract the R-code chunks from each Rmd file, save them in a “raw_scripts” folder, and then source/execute the scripts within a regular R script file (.R).

5.7.1.1 R Code

In this section we establish the file path to the folder that contains all the child documents. The names of the child documents are extracted and stored as a vector. The grepl() function is used to retain only the Rmd files stored in the vector.

sections.path <- c("notebook/sections")
r.files.vec <- list.files(sections.path)
r.file.vec <- r.files.vec[grepl(".Rmd", r.files.vec)]

Next, a file path is specified for the R-scripts that will be extracted from the R Markdown documents; I place these files within a “raw_script/extracted” folder. The map() function from the purrr package is used to loop through each file in the vector (r.files.vec). Within the map() loop, the purl() function from knitr is used to extract the R-code from the R Markdown documents and save the code to the specified folder.

extracted.path <- c("notebook/raw_script/extracted")
purrr::map(r.files.vec, function(file.i) {
  file.name <- gsub(".Rmd", "", ".R")
  extracted.file <- paste0(file.name, ".R")
  knitr::purl(
    file.path(sections.path, file.i),
    file.path(extracted.path, extracted.file)
    )
})

Finally, create a vector of file names (source.vec) stored in the “raw_script/extracted” folder. You will want to type these out manually (do not use list.files() functions) because in this format you can easily comment out certain scripts and only run the scripts of interest. The map() is then used to loop through each specified file in source.vec. Keep in mind that the order of the file names specified in source.vec will determine the order that these files are executed in the map() function; therefore, order the files in source.vec from furthest upstream to furthest downstream. Each iteration of the loop, executes (sources) the specified R-script.

source.vec <- c(
  "import_data.R",
  "clean_data.R",
  "visualize_data.R"
)

purrr::map(source.vec, function(source.i) {
  source(file.path(extracted.path, source.i))
})

Once all the R-scripts extracted from the upstream child R Markdown files have been executed, you can begin or continue work on a new child R Markdown document. I keep all the above code in a single R-script and execute the entire script each time I use this file to make sure all of the files are up-to-date.

6 R Project Exercises

  1. Create an R project
  2. Drag the files from the flash drive into your R project
  3. Run the parent.Rmd within the markdown folder

7 Base R

7.1 Data Types

7.1.1 Numeric (Double)

  • Can include integers and decimal numbers (continuous variables)
  • EXAMPLE:
is.numeric(c(1, 2, 3.5, 6.7))
## [1] TRUE
is.numeric(c(1, 2, 3, 6))
## [1] TRUE
is.numeric(1:5)
## [1] TRUE
is.numeric(c(TRUE, FALSE, TRUE))
## [1] FALSE
is.numeric(c("a", 2, 3, 6))
## [1] FALSE
is.numeric(c("a", "b", "c", "d"))
## [1] FALSE

7.1.2 Integer

  • Only whole numbers (discrete variables)
  • From is.integer()help file: “is.integer(x) does not test if x contains integer numbers!” *EXAMPLE:
is.integer(c(1, 2, 3.5, 6.7))
## [1] FALSE
is.integer(c(1, 2, 3, 6))
## [1] FALSE
is.integer(1:5)
## [1] TRUE
is.integer(c(TRUE, FALSE, TRUE))
## [1] FALSE
is.integer(c("a", 2, 3, 6))
## [1] FALSE
is.integer(c("a", "b", "c", "d"))
## [1] FALSE

7.1.3 Logical

  • TRUE or FALSE values
  • EXAMPLE:
is.logical(c(1, 2, 3.5, 6.7))
## [1] FALSE
is.logical(c(1, 2, 3, 6))
## [1] FALSE
is.logical(1:5)
## [1] FALSE
is.logical(c(TRUE, FALSE, TRUE))
## [1] TRUE
is.logical(c("a", 2, 3, 6))
## [1] FALSE
is.logical(c("a", "b", "c", "d"))
## [1] FALSE

7.1.4 Character

  • Alphanumeric
  • EXAMPLE:
is.character(c(1, 2, 3.5, 6.7))
## [1] FALSE
is.character(c(1, 2, 3, 6))
## [1] FALSE
is.character(1:5)
## [1] FALSE
is.character(c(TRUE, FALSE, TRUE))
## [1] FALSE
is.character(c("a", 2, 3, 6))
## [1] TRUE
is.character(c("a", "b", "c", "d"))
## [1] TRUE

7.1.5 Factor

  • A special integer vector to assign order
  • Use: assign custom order to categorical variables
    • If you were to sort a character vector, R would sort the vector alphabetically.
  • EXAMPLE:
is.factor(c(1, 2, 3.5, 6.7))
## [1] FALSE
is.factor(c(1, 2, 3, 6))
## [1] FALSE
is.factor(1:5)
## [1] FALSE
is.factor(c(TRUE, FALSE, TRUE))
## [1] FALSE
is.factor(c("a", 2, 3, 6))
## [1] FALSE
is.factor(c("a", "b", "c", "d"))
## [1] FALSE

Example of sorting a character vector.

sort(c("a", "b", "c", "d"))
## [1] "a" "b" "c" "d"

Example of sorting a factor vector.

sort(factor(c("a", "b", "c", "d"),
            levels = c("d", "a", "c", "b")))
## [1] d a c b
## Levels: d a c b
as.numeric(sort(factor(
  c("a", "b", "c", "d"),
  levels = c("d", "a", "c", "b")
  )))
## [1] 1 2 3 4

7.1.5.1 Example

In this example there are three stations on the Hudson River (i.e., Port of Albany, West Point, and Pier 84).

Although this is a tidal river, we would generally want to sort and plot this data from upstream to downstream (i.e., Port of Albany, West Point, and Pier 84). The table below shows how sort() would arrange the data if it is stored as a character vs. factor type.

8 Data Type Exercises

  1. Create a character vector with 4 elements
  2. Verify that the vector created is character type
  3. Sort the character vector
  4. Convert the character vector to a factor vector and specify levels that are NOT in alphabetical order

9 Data Structures

Common Data Structures:

  • Vectors
    • 1 Dimensional
    • Homogenous data type
  • Matrices
    • 2 Dimensional
    • Homogenous data type
  • Arrays
    • Greater than 2 Dimensions
    • Homogenous data type
  • Data Frames
    • 2 Dimensional
    • Heterogenous data types
  • Lists
    • Can contain various data types (vectors, matrices, arrays, and data frames)
    • Heterogenous data types

9.1 Vectors

  • 1 Dimensional
  • Homogenous data type

Numeric vector

c(1, 2, 3, 4, 5)
## [1] 1 2 3 4 5

Character Vector

c("A", "B", "C", "D", "E")
## [1] "A" "B" "C" "D" "E"

Logical vector

c(TRUE, FALSE, TRUE, FALSE, TRUE)
## [1]  TRUE FALSE  TRUE FALSE  TRUE

9.2 Matrices

  • 2 Dimensional
  • Homogenous data type
matrix(c(1:5, 2, 7, 4, 9, 20, 40, 23, 64, 67, 80, 3, 76, 29, 59, 91),
       ncol = 4)
##      [,1] [,2] [,3] [,4]
## [1,]    1    2   40    3
## [2,]    2    7   23   76
## [3,]    3    4   64   29
## [4,]    4    9   67   59
## [5,]    5   20   80   91

9.3 Arrays

  • Greater than 2 Dimensions
  • Homogenous data type
array(c(c(1:5, 2, 7, 4, 9, 20, 40, 23, 64, 67, 80, 3, 76, 29, 59, 91, 1, 3, 5, 7, 9),
        c(2, 3, 34, 45, 57, 26, 74, 42, 91, 20, 1, 0, 82, 31, 45, 1, 0, 1, 0, 1, 9:5),
        c(1, 0, rep(1, 3), rep(0, 5), 1, 0, 1, 1, 0, rep(1, 5), 0, 0, 1, 1, 0)),
      c(5, 5, 3))
## , , 1
## 
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    2   40    3    1
## [2,]    2    7   23   76    3
## [3,]    3    4   64   29    5
## [4,]    4    9   67   59    7
## [5,]    5   20   80   91    9
## 
## , , 2
## 
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    2   26    1    1    9
## [2,]    3   74    0    0    8
## [3,]   34   42   82    1    7
## [4,]   45   91   31    0    6
## [5,]   57   20   45    1    5
## 
## , , 3
## 
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    0    1    1    0
## [2,]    0    0    0    1    0
## [3,]    1    0    1    1    1
## [4,]    1    0    1    1    1
## [5,]    1    0    0    1    0

9.4 Data Frames

  • 2 Dimensional
  • Heterogenous data types
data.frame(
  Numeric = c(1, 2, 3, 4, 5),
  Charcter = c("A", "B", "C", "D", "E"),
  Logical = c(TRUE, FALSE, TRUE, FALSE, TRUE),
  stringsAsFactors = FALSE
)
##   Numeric Charcter Logical
## 1       1        A    TRUE
## 2       2        B   FALSE
## 3       3        C    TRUE
## 4       4        D   FALSE
## 5       5        E    TRUE

9.5 Lists

list(
  c(TRUE, FALSE, TRUE, FALSE, TRUE),
  matrix(c("A", "B", "C", "D", "E",
           "AB", "BC", "CD", "DE", "EF",
           rep("BLAH", 5),
           "TEXT", "text", "TEXT", "text", "TEXT"),
         ncol = 4),
  array(c(c(1:5, 2, 7, 4, 9, 20, 40, 23, 64, 67, 80, 3, 76, 29, 59, 91, 1, 3, 5, 7, 9),
        c(2, 3, 34, 45, 57, 26, 74, 42, 91, 20, 1, 0, 82, 31, 45, 1, 0, 1, 0, 1, 9:5),
        c(1, 0, rep(1, 3), rep(0, 5), 1, 0, 1, 1, 0, rep(1, 5), 0, 0, 1, 1, 0)),
      c(5, 5, 3)),
  data.frame(
    Numeric = c(1, 2, 3, 4, 5),
    Charcter = c("A", "B", "C", "D", "E"),
    Logical = c(TRUE, FALSE, TRUE, FALSE, TRUE),
    stringsAsFactors = FALSE
  )
)
## [[1]]
## [1]  TRUE FALSE  TRUE FALSE  TRUE
## 
## [[2]]
##      [,1] [,2] [,3]   [,4]  
## [1,] "A"  "AB" "BLAH" "TEXT"
## [2,] "B"  "BC" "BLAH" "text"
## [3,] "C"  "CD" "BLAH" "TEXT"
## [4,] "D"  "DE" "BLAH" "text"
## [5,] "E"  "EF" "BLAH" "TEXT"
## 
## [[3]]
## , , 1
## 
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    2   40    3    1
## [2,]    2    7   23   76    3
## [3,]    3    4   64   29    5
## [4,]    4    9   67   59    7
## [5,]    5   20   80   91    9
## 
## , , 2
## 
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    2   26    1    1    9
## [2,]    3   74    0    0    8
## [3,]   34   42   82    1    7
## [4,]   45   91   31    0    6
## [5,]   57   20   45    1    5
## 
## , , 3
## 
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    0    1    1    0
## [2,]    0    0    0    1    0
## [3,]    1    0    1    1    1
## [4,]    1    0    1    1    1
## [5,]    1    0    0    1    0
## 
## 
## [[4]]
##   Numeric Charcter Logical
## 1       1        A    TRUE
## 2       2        B   FALSE
## 3       3        C    TRUE
## 4       4        D   FALSE
## 5       5        E    TRUE

10 Data Structure Manipulation

10.1 Assignment Opperator (<-)

The assignment opperator (<-) is used to assign data structures to a named object that will be stored in the Environment. The object on the left side of the assignment opperator is going to refer to a new or existing data object. The object to the right of the assignment opperator is going to represent the data that we want to store in the object to the left of the opperator. When the code is executed, the object will be stored in the Environment, which is visible within the RStudio Environment window.

For example, the integer vector from 1-5, created using c(1:5), is stored with the object name, int.vec. The image below this code chunk, shows the Environment before and after the code has been executed. We can see that int.vec has been added to the Environment.

int.vec <- c(1:5)

Here is an example of a data frame being stored as an object, example.df. See [Environment Tab] section to review the helpful tools available in RStudio for viewing elements of a data frame.

example.df <- data.frame(
  Numeric = c(1, 2, 3, 4, 5),
  Character = c("A", "B", "C", "D", "E"),
  Logical = c(TRUE, FALSE, TRUE, FALSE, TRUE),
  stringsAsFactors = FALSE
)

10.2 Manipulting Vectors

char.vec <- c("a", "b", "c", "d", "e")
char.vec
## [1] "a" "b" "c" "d" "e"
char.vec[3]
## [1] "c"
char.vec[c(1, 3)]
## [1] "a" "c"

The colon opperator can be helpful for grabbing a sequential range. Here I am specifiying elements 1, 2, and 3 of char.vec.

char.vec[1:3]
## [1] "a" "b" "c"

10.3 Manipulating Data Frames

This data frame will be used as an example in this section.

example.df <- data.frame(
  Numeric = c(1, 2, 3, 4, 5),
  Character = c("A", "B", "C", "D", "E"),
  Logical = c(TRUE, FALSE, TRUE, FALSE, TRUE),
  stringsAsFactors = FALSE
)
knitr::kable(example.df)
Numeric Character Logical
1 A TRUE
2 B FALSE
3 C TRUE
4 D FALSE
5 E TRUE

When dealing with data frames think of the square brackets ([, ]) as a coordinate system. Values specified to the left of the comma in [, ] represents rows, while everything to the right of the comma represents columns.

example.df[, 1]
## [1] 1 2 3 4 5
example.df[1, ]
##   Numeric Character Logical
## 1       1         A    TRUE
example.df[1, 1]
## [1] 1
example.df[1:2, 1:2]
##   Numeric Character
## 1       1         A
## 2       2         B
example.df$Numeric
## [1] 1 2 3 4 5
example.df[, "Numeric"]
## [1] 1 2 3 4 5
example.df[example.df$Numeric == 1, ]
##   Numeric Character Logical
## 1       1         A    TRUE
example.df[example.df$Character %in% c("A", "C"), ]
##   Numeric Character Logical
## 1       1         A    TRUE
## 3       3         C    TRUE
example.df[example.df$Character %in% c("A", "C"), "Numeric"]
## [1] 1 3

11 Data Structure Exercises

  1. Store “one”, “two”, “three”, “four”, “five” as a vector object called “char.vec”
  2. Extract just “three” from char.vec
  3. Extract “two” and “four” from char.vec
  4. Show two ways to extract “two”, “three”, and “four”
  5. “iris” is a famous data set built into R. Type and run “iris” within a code chunk
  6. Store iris as a data frame called “iris.df”
  7. Show two ways to extract the column “Petal.Width” from iris.df
  8. Extract the second row from iris.df (should equal: 4.9, 3, 1.4, 0.2, setosa)
  9. Extract the tenth value from Petal.Width in iris.df (0.1)
  10. Subset iris.df to only include rows equal to “veriscolor”
  11. Subset iris.df to only include rows the match “setosa” and “virginica”

12 Style Guide

Code style is more important than you may first imagine. Adopting a consistent style will make it easier for you and your collaborators to read and comprehend your code. Please review and in future R-code use Hadley Wickham’s (http://style.tidyverse.org/index.html) style guide. As mentioned in Hadely Wickham’s guide, his guide is adapted from Google’s style guide (https://google.github.io/styleguide/Rguide.xml); therefore, there are many similarities. I do not want to recreate these style guides here, instead I want to highlight what I believe are some of the more important features.

12.1 Names

File names, function names, and column names should not contain spaces. It is very easy to create a name with two subsequent spaces by mistake and very frustrating to later trouble shoot why your call to this name in your R code is returning an error.

I use snake case, “snake_case”, instead of “snake case” or “snakeCase;” the later, “snakeCase”, is known as camel case. Many programmers use camel case but I find I make more typos when I use this naming scheme and in find snake case easier to read. Please use snake case.

12.1.1 Object Names: Discriptive Suffix

For object names, I prefer a style similar to the one found in Google’s style guide but with a descriptive suffix. I cannot provide a reference to this style but at one point I adopted a descriptive suffix that describes the objects class (e.g., data frame = “.df”, vector = “.vec”, scalar = “.scal”, list = “.lst”, matrix = “.mat”, and time series = “.ts”). I have found this simple naming scheme to be very helpful because I immediately know what the intended class of the object at any point that it is referenced in the script. This makes it easier to identify a bug if an object named “object.df” is represented in my RStudio Environment pane as a vector or list.

Examples:

# Data Frame-------------------------------------------------------------------
my.df <- data.frame()

# Vector-----------------------------------------------------------------------
my.vec <- c("a", "b", "c")
my.scal <- "a"

my.vec <- c(1:3)
my.scal <- 1

# Matrix-----------------------------------------------------------------------
my.mat <- matrix()

# List-------------------------------------------------------------------------
my.lst <- list()

# Time Series------------------------------------------------------------------
my.ts <- ts()

12.2 Spacing and Indenting

Please follow the spacing (http://style.tidyverse.org/syntax.html#spacing) and indenting (http://style.tidyverse.org/syntax.html#indenting) guide lines provided in Hadely Wickham’s style guide. I find it very difficult to follow code that does not adhere to these guidelines.

The following “good” and “bad” examples will create the exact same data frame. However, the “good” example is much easier to read and interpret.

Good Example:

good.df <- data.frame(
  alphabet = letters,
  square_root = sqrt(81),
  add = 1 + 1,
  subtract = 1 - 1,
  multiply = 2 * 2,
  divide = 2 / 2,
  power = 2^2
)

Bad Example:

bad.df<-data.frame(alphabet=letters,square_root=sqrt(81),add=1+1,subtract=1-1,multiply=2*2,divide=2/2,power=2^2)

13 Writing Functions

It is highly recommended that you learn to write your own functions. In general, anytime you have the urge to copy and paste code to perform a similar process on a different aspect of your data you should instead write a function. A function is a single standardized process that you apply to similar data sets. Therefore, it is easier to update the function rather than multiple chunks of code that have been copied and pasted throughout your script. For example, imagine you have copied and pasted code ten times. Later you want to update the code. Now you have to update all ten instances of the code. This is time consuming and it would be easy to miss or add a typo to one of the ten instances. If the process is stored as a function, you make one update to the function and that update is applied in a standardized manner to all ten instances of your data.

The image below provides an overview of the pieces necessary to create a function. Use function() to initialize the creation of a new function. Within function you will, generally, add arguements (in example below: x, y) that you will want to be able to modify. Curly brackets ({}) following the function() call define the extent of the body of the function being created. Within the curly brackets is wher most of the work is performed; this is the area where we insert our code that we want to become a standarized process. This example is very simple, just x -y, but you could insert 100+ lines of code into a function. However, really lenghty functions are poor practice. Finally, the function needs to be stored as an object. Use the assignment opperator (<-) to assign this function an object name (e.g., subtract <-).

Let us actually create this function.

subtract <- function(x, y) {
  x - y
}

Once this function has been executed, it is stored as a object named “subtract”. This object can be seen in the Environment window.

Now we can test our function.

subtract(x = 10, y = 5)
## [1] 5

We can get the same results without specifically calling out x and y. R will assume that you understand the order of the arguements is x, y, and therefore 10 will represent x and 5 will represent y.

subtract(10, 5)
## [1] 5

If we reverse the order, then 5 represents x and 10 represents y.

subtract(5, 10)
## [1] -5

Our intention with the created subtract function is to subtract one value from another. However, the code in the chunk below will run. In this code x represents the numeric vector c(1, 2, 3) and y represents the numeric vector c(3, 2, 1).

subtract(c(1, 2, 3), c(3, 2, 1))
## [1] -2  0  2

We only want this function to execute if x and y are each of length one. A conditional if() statement combined with a stop() function can be used to prevent the function from working on data that does not meet our requirements (x of length 1 and y of length 1). Note that this update would normarlly be made to the orginal function call up above. subtract() is only updated here to show the progression of our development of this function.

subtract <- function(x, y) {
  if (length(x) != 1 | length(y) != 1) stop("x and y must length 1")
  x - y
}

Now if a vector of length 3 is provided for the x or y arguments, the function will return an error with the message “x and y must length 1.” This is the message that was defined in the stop() call within the creation of the subtract() function above.

subtract(c(1, 2, 3), c(3, 2, 1))
## Error in subtract(c(1, 2, 3), c(3, 2, 1)): x and y must length 1

Double check that the original values still work.

subtract(10, 5)
## [1] 5

14 Function Exercises

  1. Create a function called “mult” with three arguements (x, y, z) that multiplies the three arguements supplied
  2. Test that your mult() function returns 6 if x = 1, y = 2, z = 3
  3. Restrict the function to only work if each of the arguements supplied are length one. Create error messages specific to each arguement, which will inform you of which arguement failed to meet the this requirement.
  4. Test that the messages created in exercise 3 work correctly on:
    • x = c(1, 2), y = 2, z = 3
    • x = 1, y = c(1, 2), z = 3
    • x = 1, y = 2, z = c(1, 2)
  5. Which message will you recieve if x = 1, y = c(1, 2), z = c(1, 2)? Why is this the only message that is returned?

  6. Can you make a function that standardizes the three length checks?
    • Call the function “length_check”
    • Hints:
      • Should have 1 arguement
      • Checkout paste() (?paste)
      • Use substitute() within the paste() call to return the name of the arguement supplied (i.e., “x”, “y”, or “z”)
  7. Add check_length() to your mult() function.
  8. Test the your updated mult() function on:
    • x = c(1, 2), y = 2, z = 3
    • x = 1, y = c(1, 2), z = 3
    • x = 1, y = 2, z = c(1, 2)
mult(x = 1, y = 2, z = 3)
## Error in mult(x = 1, y = 2, z = 3): could not find function "mult"
mult(x = c(1, 2), y = 2, z = 3)
## Error in mult(x = c(1, 2), y = 2, z = 3): could not find function "mult"
mult(x = 1, y = c(1, 2), z = 3)
## Error in mult(x = 1, y = c(1, 2), z = 3): could not find function "mult"
mult(x = 1, y = 2, z = c(1, 2))
## Error in mult(x = 1, y = 2, z = c(1, 2)): could not find function "mult"

15 Tidyverse

*Link: https://www.tidyverse.org/

*Great Place to Start: https://r4ds.had.co.nz/

The tidyverse is an ecosystem of packages that work well together and often make it easier to deal with data in R. These packages are mostly developed and maintained by RStudio staff but contributions are frequently made by members of the R community.

WARNING: These packages are very helpful but many are not yet stable. There is potential that a function could undergo a large change or be depreciated in the future. This can have a significant negative impact on the reproducibility of your work. The functions in tidyverse packages will generally provide helpful messages indicating if a function has been depreciated and indicate the function that has taken its place.

Instead of loading each tidyverse package individually, library(tidyverse) will load the most frequently used packages: ggplot2, purrr, tibble, dplyr, tidyr, stringr, readr, and forcats.

library(tidyverse)
## -- Attaching packages --------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.1.0       v purrr   0.3.0  
## v tibble  2.0.1       v dplyr   0.8.0.1
## v tidyr   0.8.2       v stringr 1.4.0  
## v readr   1.3.1       v forcats 0.4.0
## -- Conflicts ------------------------------------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

15.1 Tidyverse Directory

  • Pipe Operator
  • Data Structure
    • tibble
  • Data Manipulation
  • Loops
    • purrr
  • Visualization
  • Dates and Datetimes
  • Strings
    • stringr
    • glue
  • Factors
    • forcats
  • Import Data
    • readr
    • readXl
    • haven
  • Store Binary Data
    • blob

15.2 magrittr

library(magrittr)
## 
## Attaching package: 'magrittr'
## The following object is masked _by_ '.GlobalEnv':
## 
##     subtract
## The following object is masked from 'package:purrr':
## 
##     set_names
## The following object is masked from 'package:tidyr':
## 
##     extract

15.2.1 Example Data

This vector will be used as an example for this section. Notice that all of the strings are slightly different. I frequently find issues like this in data sets I have been given. We will assume that all of the strings are intended to represent the same information, and therefore require some processing to standardize the strings.

x <- c("example 1",
       " example 1",
       "example 1 ", 
       "example_1")

15.2.2 Pipe Operator (%>%)

I often find that the pipe operator, %>%, from the magrittr package is confusing to those unfamiliar with the tidyverse. It takes a little while to wrap your head around the pipe operator but once you do I think you will find its use makes R code more legible. In essence, base R works from the inside-out, while the pipe operator presents the code in a linear fashion.

For example, imagine you have a character vector x and you want to trim leading/trailing white space, then keep only unique strings, and then convert all characters to lowercase. In base R, the code would look like the code in the chunk below. Again, base R works from the inside-out, so first trimws() is used to remove leading/trailing white space, then tolower() is used to convert all characters to lowercase, and finally str_replace() is used to replace any spaces with an underscore.

str_replace(tolower(trimws(x)), " ", "_")
## [1] "example_1" "example_1" "example_1" "example_1"

Using the pipe operator, the code functions just the same but it is formatted in a more legible manner:

  1. x is piped to trims()
  2. the output of trims() is piped to tolower()
  3. the output of tolower() is piped to str_replace()
  4. the output of str_replace() is returned

The pipe operator presents the functions in the order you intend them to be performed; therefore, the code should be easier to read and more intuitive, which should in turn reduce errors.

x %>% 
  trimws() %>% 
  tolower() %>% 
  str_replace(" ", "_")
## [1] "example_1" "example_1" "example_1" "example_1"

Most tidyverse packages (e.g., NOT ggplot2) are designed to work well with the pipe operator. The pipe operator pushes the call on the left side of the operator to the first position of the function on the right of the operator. Therefore, functions developed outside of the tidyverse may not automatically work will with the pipe operator. A . can be used as an a placeholder.

In the examples above, str_replace(), from the stringr package, was used. This function is designed to work well with the pipe operator but what if we wanted to use gsub(), the base R equivalent to str_replace()?

  • str_replace() structure:
    • str_replace(string, pattern, replacement)
  • gsub() structure:
    • gsub(pattern, replacement, x)

In gsub(), the x arguement is equivalent to the string arguement in str_replace(). Notice the difference in the order of the arguements. In str_replace(), string is the first arguement, but in gsub(), x is the last arguement. Therefore, when using the pipe operator with gsub(), the placeholder (.) must be used becuase the output from x %>% trimws() %>% tolower() by default will be piped into the pattern arguement, the first arguement.

Here is the error recieved if the placeholder (.) is NOT used:

x %>% 
  trimws() %>% 
  tolower() %>% 
  gsub()
## Error in gsub(.): argument "x" is missing, with no default

Here is how the placeholder(.) would be used in a gsub() call. pattern and replacement are speciefied as normal, while x = . indicates where the output from x %>% trimws() %>% tolower() should be used.

x %>% 
  trimws() %>% 
  tolower() %>% 
  gsub(pattern = " ",
       replacement = "_",
       x = .)
## [1] "example_1" "example_1" "example_1" "example_1"

16 Pipe Operator Exercises

library(magrittr)
  1. Extract the Species column from iris and pipe it into the function toupper()
  2. Extract just unique strings from the Species column of iris by piping iris into the function unique()
  3. Combine exercises 1 and 2 to return unique strings from the Species column represented as all caps

16.1 dplyr

suppressPackageStartupMessages(
  library(dplyr)
)

16.1.1 Example Data

Import the example data. This data represents benthic macroinvertebrate data collected in the littoral zone of Onondaga, Otisco, and Cazenovia lakes

taxa.df <- file.path("data",
                     "zms_thesis-macro_2017-06-18.csv") %>% 
  read.csv(stringsAsFactors = FALSE)

DT::datatable(taxa.df, options = list(scrollX = TRUE))
env.df <- file.path("data",
                     "zms_thesis-env_2017-06-18.csv") %>% 
  read.csv(stringsAsFactors = FALSE)

DT::datatable(env.df, options = list(scrollX = TRUE))

16.1.2 Rename

In the example below, the columns lat and long are renamed to latitude and longitude, respectively. If names(taxa.df) is called, elements 5 and 6 are now represented by “latitude” and “longitude”, respectively.

taxa.df <- taxa.df %>% 
  rename(latitude = lat,
         longitude = long)

names(taxa.df)
##  [1] "unique_id"     "lake"          "station_id"    "sample_number"
##  [5] "latitude"      "longitude"     "agency_code"   "method"       
##  [9] "date"          "count"         "life_stage"    "final_id"     
## [13] "taxon_level"   "phylum"        "subphylum"     "class"        
## [17] "subclass"      "order"         "suborder"      "family"       
## [21] "subfamily"     "tribe"         "genus"         "species"      
## [25] "picked"        "squares"

16.1.3 Filter

In the example below, taxa.df is subset to only include rows where the unique_id column matches the string “caz_1_1”.

filter.df <- taxa.df %>% 
  filter(unique_id == "caz_1_1")

DT::datatable(filter.df, options = list(scrollX = TRUE))

You can apply multiple filters separate by commas. The filters are applied from the top down. In the example below, two filters are applied within the filter() call:

  1. unique_id == "caz_1_1"
  2. order %in% c("ephemeroptera", "trichoptera")

The first logical statement is the same as the filter from the code chunk above, which keeps only rows associated with the sample “caz_1_1”. Then, the order column is subset the data to only include rows represented by ephemeroptera (mayfly) or trichoptera (caddisfly) taxa.

filter.df <- taxa.df %>% 
  filter(unique_id == "caz_1_1",
         order %in% c("ephemeroptera", "trichoptera"))

DT::datatable(filter.df, options = list(scrollX = TRUE))

16.1.4 Select

Often we are not interested in all columns in a data frame. select() can be used to subset the columns to only include columns of interest. In the example below, the filter.df data frame, created in the Filter section, is subset to only include three columns: 1) final_id, count, and unique_id.

select.df <- filter.df %>% 
  select(unique_id, final_id, count)

knitr::kable(select.df)
unique_id final_id count
caz_1_1 baetidae 1
caz_1_1 caenis 14
caz_1_1 agraylea 1
caz_1_1 oxyethira 2
caz_1_1 hydroptilidae 1

The same operation can be performed by chaining the functions together with the pipe operator. taxa.df is first filtered to only include rows that meet our specified logical statements and then only the columns unique_id, final_id, and count are retained.

select.df <- taxa.df %>%
  filter(unique_id == "caz_1_1",
         order %in% c("ephemeroptera", "trichoptera")) %>% 
  select(unique_id, final_id, count)

knitr::kable(select.df)
unique_id final_id count
caz_1_1 baetidae 1
caz_1_1 caenis 14
caz_1_1 agraylea 1
caz_1_1 oxyethira 2
caz_1_1 hydroptilidae 1

Reorder columns. Maybe you would prefer to see the columns in the following order:

  1. final_id
  2. unique_id
  3. count
reorder.df <- taxa.df %>%
  filter(unique_id == "caz_1_1",
         order %in% c("ephemeroptera", "trichoptera")) %>% 
  select(final_id, unique_id, count)

knitr::kable(reorder.df)
final_id unique_id count
baetidae caz_1_1 1
caenis caz_1_1 14
agraylea caz_1_1 1
oxyethira caz_1_1 2
hydroptilidae caz_1_1 1

16.1.4.1 everything

For example, maybe we want to see the first three columns reordered but the remaining columns to remain in the same order. Without everything(), we would need to specify each column in the select() call. With everything(), we can specify the first three columns to be lake, station_id, and sample_number followed by everything(). The columns will then be reordered accordingly (lake, station_id, sample_number, unique_id, etc.).

reorder.df <- taxa.df %>%
  filter(unique_id == "caz_1_1",
         order %in% c("ephemeroptera", "trichoptera")) %>% 
  select(lake, station_id, sample_number, everything())

DT::datatable(reorder.df, options = list(scrollX = TRUE))

16.1.5 distinct

After performing a subsetting columns with select(), it may be beneficial to subsequently run distinct() to remove any duplicate rows.

Maybe we are just interested in viewing basic sample information such as lake, station_id, and sample_number. If select() is performed to subset the data frame to only represent these columns, then there will be many duplicate rows. The reason for this is in the original taxa.df data frame there are multiple taxa observed per sample. lake, station_id, and sample_number are associated accordingly with the taxa, and therefore lake, station_id, and sample_number are represented many times.

nondistinct.df <- taxa.df %>% 
  select(lake, station_id, sample_number)

DT::datatable(nondistinct.df, 
              options = list(columnDefs = list(list(className = 'dt-center', targets = 0:3))))

distinct() will remove the duplicate rows present in the output above. This is a very simple call, which requires no input.

distinct.df <- taxa.df %>% 
  select(lake, station_id, sample_number) %>% 
  distinct()

DT::datatable(distinct.df, 
              options = list(columnDefs = list(list(className = 'dt-center', targets = 0:3))))

16.1.6 mutate

In the example below, I use the select() and distinct() combination from the distinct section and then filter the data frame to only include rows where lake equals “caz”. The distinct() call has insured that each row is unique but there is not a single column that could be used to uniquely identify a given sample. Using mutate() we can create a new column, called new_id, that concatenates the station_id and sample_number columns into a single value separated by an underscore (paste(station_id, sample_number, sep = "_")).

mutate.df <- taxa.df %>% 
  select(lake, station_id, sample_number) %>% 
  distinct() %>%
  filter(lake == "caz") %>% 
  mutate(new_id = paste(station_id, sample_number,
                        sep = "_"))

DT::datatable(mutate.df,
              options = list(columnDefs = list(list(className = 'dt-center', targets = 0:4))))

Maybe a second type of unique identifier is wanted for certain subsequent processes. mutate() allows for the creation of one or more columns within a single call. In the example below, new_id is created as it was in the example above but is followed by the creation of date_id. date_id is created by concatenating new_id and date columns. Therefore, a single mutate() call has created two new columns. Additionally, we can see that columns created downstream (date_id) can reference columns created upstream (new_id) within the mutate() call.

mutate.df <- taxa.df %>% 
  select(lake, station_id, sample_number, date) %>% 
  distinct() %>%
  filter(lake == "caz") %>% 
  mutate(new_id = paste(station_id, sample_number,
                        sep = "_"),
         date_id = paste(new_id, date,
                         sep = "_"))

DT::datatable(mutate.df,
              options = list(columnDefs = list(list(className = 'dt-center', targets = 0:6))))

16.1.7 group_by

Calling group_by() will aggregate the data by the column(s) specified but alone will not alter the content of the data frame.

group.df <- taxa.df %>%
  filter(unique_id %in% c("caz_1_1", "caz_1_2")) %>% 
  select(unique_id, final_id, count) %>% 
  group_by(unique_id)

knitr::kable(group.df)
unique_id final_id count
caz_1_1 physa 1
caz_1_1 gyraulus 1
caz_1_1 bezzia 5
caz_1_1 mallochohelea 7
caz_1_1 chironomus 1
caz_1_1 dicrotendipes 3
caz_1_1 paratanytarsus 34
caz_1_1 cricotopus 1
caz_1_1 psectrocladius 5
caz_1_1 thienemanniella 4
caz_1_1 krenopelopia 9
caz_1_1 procladius 1
caz_1_1 hemerodromia 1
caz_1_1 baetidae 1
caz_1_1 caenis 14
caz_1_1 acentria 1
caz_1_1 enallagma 1
caz_1_1 agraylea 1
caz_1_1 oxyethira 2
caz_1_1 hydroptilidae 1
caz_1_1 gammarus 2
caz_1_1 hyalella 14
caz_1_1 caecidotea 2
caz_1_2 amnicola_grana 2
caz_1_2 physa 27
caz_1_2 gyraulus_parvus 1
caz_1_2 helisoma_anceps 2
caz_1_2 promenetus_exacuous 5
caz_1_2 valvata 1
caz_1_2 bezzia 9
caz_1_2 mallochohelea 2
caz_1_2 chironomus 5
caz_1_2 dicrotendipes 4
caz_1_2 microtendipes 1
caz_1_2 polypedilum 2
caz_1_2 pseudochironomus 1
caz_1_2 rheotanytarsus 2
caz_1_2 thienemanniella 1
caz_1_2 ablabesmyia 5
caz_1_2 callibaetis 1
caz_1_2 caenis 1
caz_1_2 caenidae 2
caz_1_2 acentria 1
caz_1_2 mystacides 2
caz_1_2 gammarus 1
caz_1_2 hyalella 13
caz_1_2 caecidotea 9

The image below shows “under the hood”" information about group.df from the [Environment Tab] before and after the group_by() call is made. We do not need to focus on the details associated with this “under the hood” information but this way we can view the change made by the group_by() call; unlike the table shown above, which does not provided any indication that the data has been aggregated.

We can follow group_by() by a mutate() call to calculate a single value per group. Each group variable is replicated for each row within a group. In the example below, we are aggregating by two unique_ids (group_by(unique_id); “caz_1_1” and “caz_1_2”) and finding the total number of organisms identified within each group (mutate(total = sum(count))). The sum() function returns one number per group. mutate() then replicates this one number for all rows within a group.

group.df <- taxa.df %>%
  filter(unique_id %in% c("caz_1_1", "caz_1_2")) %>% 
  select(unique_id, final_id, count) %>% 
  group_by(unique_id) %>% 
  mutate(total = sum(count))

knitr::kable(group.df)
unique_id final_id count total
caz_1_1 physa 1 112
caz_1_1 gyraulus 1 112
caz_1_1 bezzia 5 112
caz_1_1 mallochohelea 7 112
caz_1_1 chironomus 1 112
caz_1_1 dicrotendipes 3 112
caz_1_1 paratanytarsus 34 112
caz_1_1 cricotopus 1 112
caz_1_1 psectrocladius 5 112
caz_1_1 thienemanniella 4 112
caz_1_1 krenopelopia 9 112
caz_1_1 procladius 1 112
caz_1_1 hemerodromia 1 112
caz_1_1 baetidae 1 112
caz_1_1 caenis 14 112
caz_1_1 acentria 1 112
caz_1_1 enallagma 1 112
caz_1_1 agraylea 1 112
caz_1_1 oxyethira 2 112
caz_1_1 hydroptilidae 1 112
caz_1_1 gammarus 2 112
caz_1_1 hyalella 14 112
caz_1_1 caecidotea 2 112
caz_1_2 amnicola_grana 2 100
caz_1_2 physa 27 100
caz_1_2 gyraulus_parvus 1 100
caz_1_2 helisoma_anceps 2 100
caz_1_2 promenetus_exacuous 5 100
caz_1_2 valvata 1 100
caz_1_2 bezzia 9 100
caz_1_2 mallochohelea 2 100
caz_1_2 chironomus 5 100
caz_1_2 dicrotendipes 4 100
caz_1_2 microtendipes 1 100
caz_1_2 polypedilum 2 100
caz_1_2 pseudochironomus 1 100
caz_1_2 rheotanytarsus 2 100
caz_1_2 thienemanniella 1 100
caz_1_2 ablabesmyia 5 100
caz_1_2 callibaetis 1 100
caz_1_2 caenis 1 100
caz_1_2 caenidae 2 100
caz_1_2 acentria 1 100
caz_1_2 mystacides 2 100
caz_1_2 gammarus 1 100
caz_1_2 hyalella 13 100
caz_1_2 caecidotea 9 100

This example can be taken one step further to calculate relative abundance (i.e., the percentage of the sample represented by each taxon) by adding percent = count / total * 100 to the mutate() call.

group.df <- taxa.df %>%
  filter(unique_id %in% c("caz_1_1", "caz_1_2")) %>% 
  select(unique_id, final_id, count) %>% 
  group_by(unique_id) %>% 
  mutate(total = sum(count),
         percent = count / total * 100)

knitr::kable(group.df)
unique_id final_id count total percent
caz_1_1 physa 1 112 0.8928571
caz_1_1 gyraulus 1 112 0.8928571
caz_1_1 bezzia 5 112 4.4642857
caz_1_1 mallochohelea 7 112 6.2500000
caz_1_1 chironomus 1 112 0.8928571
caz_1_1 dicrotendipes 3 112 2.6785714
caz_1_1 paratanytarsus 34 112 30.3571429
caz_1_1 cricotopus 1 112 0.8928571
caz_1_1 psectrocladius 5 112 4.4642857
caz_1_1 thienemanniella 4 112 3.5714286
caz_1_1 krenopelopia 9 112 8.0357143
caz_1_1 procladius 1 112 0.8928571
caz_1_1 hemerodromia 1 112 0.8928571
caz_1_1 baetidae 1 112 0.8928571
caz_1_1 caenis 14 112 12.5000000
caz_1_1 acentria 1 112 0.8928571
caz_1_1 enallagma 1 112 0.8928571
caz_1_1 agraylea 1 112 0.8928571
caz_1_1 oxyethira 2 112 1.7857143
caz_1_1 hydroptilidae 1 112 0.8928571
caz_1_1 gammarus 2 112 1.7857143
caz_1_1 hyalella 14 112 12.5000000
caz_1_1 caecidotea 2 112 1.7857143
caz_1_2 amnicola_grana 2 100 2.0000000
caz_1_2 physa 27 100 27.0000000
caz_1_2 gyraulus_parvus 1 100 1.0000000
caz_1_2 helisoma_anceps 2 100 2.0000000
caz_1_2 promenetus_exacuous 5 100 5.0000000
caz_1_2 valvata 1 100 1.0000000
caz_1_2 bezzia 9 100 9.0000000
caz_1_2 mallochohelea 2 100 2.0000000
caz_1_2 chironomus 5 100 5.0000000
caz_1_2 dicrotendipes 4 100 4.0000000
caz_1_2 microtendipes 1 100 1.0000000
caz_1_2 polypedilum 2 100 2.0000000
caz_1_2 pseudochironomus 1 100 1.0000000
caz_1_2 rheotanytarsus 2 100 2.0000000
caz_1_2 thienemanniella 1 100 1.0000000
caz_1_2 ablabesmyia 5 100 5.0000000
caz_1_2 callibaetis 1 100 1.0000000
caz_1_2 caenis 1 100 1.0000000
caz_1_2 caenidae 2 100 2.0000000
caz_1_2 acentria 1 100 1.0000000
caz_1_2 mystacides 2 100 2.0000000
caz_1_2 gammarus 1 100 1.0000000
caz_1_2 hyalella 13 100 13.0000000
caz_1_2 caecidotea 9 100 9.0000000

16.1.7.1 ungroup

Once the calculations that required aggregation (i.e., group_by()) are complete, make sure to remove the aggregation from the data frame using ungroup(). If you forget to do this, R will continue to try to apply the aggregation to future calculations, which will most likely be inappropriate. Generally, you will eventually get an error message if you forget to use ungroup().

ungroup.df <- taxa.df %>%
  filter(unique_id %in% c("caz_1_1", "caz_1_2")) %>% 
  select(unique_id, final_id, count) %>% 
  group_by(unique_id) %>% 
  mutate(total = sum(count),
         percent = count / total * 100) %>% 
  ungroup()

knitr::kable(ungroup.df)
unique_id final_id count total percent
caz_1_1 physa 1 112 0.8928571
caz_1_1 gyraulus 1 112 0.8928571
caz_1_1 bezzia 5 112 4.4642857
caz_1_1 mallochohelea 7 112 6.2500000
caz_1_1 chironomus 1 112 0.8928571
caz_1_1 dicrotendipes 3 112 2.6785714
caz_1_1 paratanytarsus 34 112 30.3571429
caz_1_1 cricotopus 1 112 0.8928571
caz_1_1 psectrocladius 5 112 4.4642857
caz_1_1 thienemanniella 4 112 3.5714286
caz_1_1 krenopelopia 9 112 8.0357143
caz_1_1 procladius 1 112 0.8928571
caz_1_1 hemerodromia 1 112 0.8928571
caz_1_1 baetidae 1 112 0.8928571
caz_1_1 caenis 14 112 12.5000000
caz_1_1 acentria 1 112 0.8928571
caz_1_1 enallagma 1 112 0.8928571
caz_1_1 agraylea 1 112 0.8928571
caz_1_1 oxyethira 2 112 1.7857143
caz_1_1 hydroptilidae 1 112 0.8928571
caz_1_1 gammarus 2 112 1.7857143
caz_1_1 hyalella 14 112 12.5000000
caz_1_1 caecidotea 2 112 1.7857143
caz_1_2 amnicola_grana 2 100 2.0000000
caz_1_2 physa 27 100 27.0000000
caz_1_2 gyraulus_parvus 1 100 1.0000000
caz_1_2 helisoma_anceps 2 100 2.0000000
caz_1_2 promenetus_exacuous 5 100 5.0000000
caz_1_2 valvata 1 100 1.0000000
caz_1_2 bezzia 9 100 9.0000000
caz_1_2 mallochohelea 2 100 2.0000000
caz_1_2 chironomus 5 100 5.0000000
caz_1_2 dicrotendipes 4 100 4.0000000
caz_1_2 microtendipes 1 100 1.0000000
caz_1_2 polypedilum 2 100 2.0000000
caz_1_2 pseudochironomus 1 100 1.0000000
caz_1_2 rheotanytarsus 2 100 2.0000000
caz_1_2 thienemanniella 1 100 1.0000000
caz_1_2 ablabesmyia 5 100 5.0000000
caz_1_2 callibaetis 1 100 1.0000000
caz_1_2 caenis 1 100 1.0000000
caz_1_2 caenidae 2 100 2.0000000
caz_1_2 acentria 1 100 1.0000000
caz_1_2 mystacides 2 100 2.0000000
caz_1_2 gammarus 1 100 1.0000000
caz_1_2 hyalella 13 100 13.0000000
caz_1_2 caecidotea 9 100 9.0000000

The image below shows “under the hood”" information about ungroup.df from the [Environment Tab] before and after the ungroup() call is made. We do not need to focus on the details associated with this “under the hood” information but this way we can view the change made by the ungroup() call; unlike the table shown above, which does not provided any indication that the data has been unaggregated.

16.1.8 summarize

  • Definition: creates or overwrites columns but only retains columns specified in group_by() and the column(s) created in the summarize() call. This function requires group_by() to be called in advance.
  • Link: https://dplyr.tidyverse.org/reference/summarise.html

In the group_by example, we calculated the percentage of the sample comprised by each final_id taxon. final_id represents the highest taxonomic resolution for which these taxa were identified; therefore, each row represented a unique taxon.

What if we want to explore these samples at a lower resolution? In the example below, the taxa are represented at the order-level. The percent calculations are correct but, in many cases, the same taxon is represented more than once. This is a poor representation of the data. summarize() can help us solve this issue in the subsequent code chunks.

sub.df <- taxa.df %>%
  filter(unique_id %in% c("caz_1_1", "caz_1_2")) %>% 
  select(unique_id, order, count) %>% 
  group_by(unique_id)  %>% 
  mutate(total = sum(count),
         percent = count / total * 100) %>% 
  ungroup()

knitr::kable(sub.df)
unique_id order count total percent
caz_1_1 basommatophora 1 112 0.8928571
caz_1_1 basommatophora 1 112 0.8928571
caz_1_1 diptera 5 112 4.4642857
caz_1_1 diptera 7 112 6.2500000
caz_1_1 diptera 1 112 0.8928571
caz_1_1 diptera 3 112 2.6785714
caz_1_1 diptera 34 112 30.3571429
caz_1_1 diptera 1 112 0.8928571
caz_1_1 diptera 5 112 4.4642857
caz_1_1 diptera 4 112 3.5714286
caz_1_1 diptera 9 112 8.0357143
caz_1_1 diptera 1 112 0.8928571
caz_1_1 diptera 1 112 0.8928571
caz_1_1 ephemeroptera 1 112 0.8928571
caz_1_1 ephemeroptera 14 112 12.5000000
caz_1_1 lepidoptera 1 112 0.8928571
caz_1_1 odonata 1 112 0.8928571
caz_1_1 trichoptera 1 112 0.8928571
caz_1_1 trichoptera 2 112 1.7857143
caz_1_1 trichoptera 1 112 0.8928571
caz_1_1 amphipoda 2 112 1.7857143
caz_1_1 amphipoda 14 112 12.5000000
caz_1_1 isopoda 2 112 1.7857143
caz_1_2 neotaenioglossa 2 100 2.0000000
caz_1_2 basommatophora 27 100 27.0000000
caz_1_2 basommatophora 1 100 1.0000000
caz_1_2 basommatophora 2 100 2.0000000
caz_1_2 basommatophora 5 100 5.0000000
caz_1_2 heterostropha 1 100 1.0000000
caz_1_2 diptera 9 100 9.0000000
caz_1_2 diptera 2 100 2.0000000
caz_1_2 diptera 5 100 5.0000000
caz_1_2 diptera 4 100 4.0000000
caz_1_2 diptera 1 100 1.0000000
caz_1_2 diptera 2 100 2.0000000
caz_1_2 diptera 1 100 1.0000000
caz_1_2 diptera 2 100 2.0000000
caz_1_2 diptera 1 100 1.0000000
caz_1_2 diptera 5 100 5.0000000
caz_1_2 ephemeroptera 1 100 1.0000000
caz_1_2 ephemeroptera 1 100 1.0000000
caz_1_2 ephemeroptera 2 100 2.0000000
caz_1_2 lepidoptera 1 100 1.0000000
caz_1_2 trichoptera 2 100 2.0000000
caz_1_2 amphipoda 1 100 1.0000000
caz_1_2 amphipoda 13 100 13.0000000
caz_1_2 isopoda 9 100 9.0000000

The most intuitive way to solve this problem, to sum the counts by unique and order columns before calculating the percentages, which can be down with a combination of group_by() and summarize().

  1. Apply group_by() to the unique_id and order columns.
    • group_by(unique_id, order)
  2. Use summarize() in the same way mutate() was applied in the mutate section. In this case, the count will be overwritten by the sum of the counts aggregated by the unique_id and order columns.
    • summarize(count = sum(count))
  3. Use ungroup() to remove the groupings created with group_by().

In the table output, we can see that each taxon, order, is only represented once per sample, unique_id.

summarize.df <- taxa.df %>%
  filter(unique_id %in% c("caz_1_1", "caz_1_2")) %>% 
  select(unique_id, order, count) %>% 
  group_by(unique_id, order) %>% 
  summarize(count = sum(count)) %>% 
  ungroup()

knitr::kable(summarize.df)
unique_id order count
caz_1_1 amphipoda 16
caz_1_1 basommatophora 2
caz_1_1 diptera 71
caz_1_1 ephemeroptera 15
caz_1_1 isopoda 2
caz_1_1 lepidoptera 1
caz_1_1 odonata 1
caz_1_1 trichoptera 4
caz_1_2 amphipoda 14
caz_1_2 basommatophora 35
caz_1_2 diptera 32
caz_1_2 ephemeroptera 4
caz_1_2 heterostropha 1
caz_1_2 isopoda 9
caz_1_2 lepidoptera 1
caz_1_2 neotaenioglossa 2
caz_1_2 trichoptera 2

To calculate the percentage of each order, apply the group_by(), mutate(), and ungroup() combination from the group_by section.

summarize.df <- taxa.df %>%
  filter(unique_id %in% c("caz_1_1", "caz_1_2")) %>% 
  select(unique_id, order, count) %>% 
  group_by(unique_id, order) %>% 
  summarize(count = sum(count)) %>% 
  ungroup() %>% 
  group_by(unique_id) %>% 
  mutate(total = sum(count),
         percent = count / total * 100) %>% 
  ungroup()

knitr::kable(summarize.df)
unique_id order count total percent
caz_1_1 amphipoda 16 112 14.2857143
caz_1_1 basommatophora 2 112 1.7857143
caz_1_1 diptera 71 112 63.3928571
caz_1_1 ephemeroptera 15 112 13.3928571
caz_1_1 isopoda 2 112 1.7857143
caz_1_1 lepidoptera 1 112 0.8928571
caz_1_1 odonata 1 112 0.8928571
caz_1_1 trichoptera 4 112 3.5714286
caz_1_2 amphipoda 14 100 14.0000000
caz_1_2 basommatophora 35 100 35.0000000
caz_1_2 diptera 32 100 32.0000000
caz_1_2 ephemeroptera 4 100 4.0000000
caz_1_2 heterostropha 1 100 1.0000000
caz_1_2 isopoda 9 100 9.0000000
caz_1_2 lepidoptera 1 100 1.0000000
caz_1_2 neotaenioglossa 2 100 2.0000000
caz_1_2 trichoptera 2 100 2.0000000

The previous code chunk works but it is getting very long. Long code chunks make it harder for your future self or others to interpret and can make it harder to find/troubleshoot bugs. Better practice would be to break the previous code chunk into 2-3 sub-tasks.

The following code chunk into 3 sub-tasks, which should make it easier to follow process and allow us to observe the data after each sub-task is executed.

Explanation of sub-tasks below:

  1. caz1.df: reduce the data frame down to only the rows and columns of interest. In this case, we are only interested in the two replicates from Cazenovia Lake site 1; therefore, I named the data frame caz1.df to describe the data represented within the data frame.
  2. order.df: summarize the taxonomic counts by the sample, unique_id, and taxonomic rank, order.
  3. percent.df: calculate the percentage of each sample, unique_id, represented by each taxon, order.
caz1.df <- taxa.df %>%
  filter(unique_id %in% c("caz_1_1", "caz_1_2")) %>% 
  select(unique_id, order, count)

order.df <- caz1.df %>% 
  group_by(unique_id, order) %>% 
  summarize(count = sum(count)) %>% 
  ungroup()

percent.df <- order.df %>% 
  group_by(unique_id) %>% 
  mutate(total = sum(count),
         percent = count / total * 100) %>% 
  ungroup()

16.1.9 Joins

Link: https://dplyr.tidyverse.org/reference/join.html left_join * right_join * full_join * [inner_join] * semi_join * anti_join

16.1.9.1 left_join

left.df <- left_join(taxa.df, env.df, by = "unique_id")

DT::datatable(left.df, options = list(scrollX = TRUE))

16.1.9.2 right_join

right.df <- right_join(taxa.df, env.df, by = "unique_id")

DT::datatable(right.df, options = list(scrollX = TRUE))

16.1.9.3 full_join

full.df <- taxa.df %>% 
  filter(lake == "onon") %>% 
  full_join(env.df, by = "unique_id")

DT::datatable(full.df, options = list(scrollX = TRUE))

16.1.9.4 semi_join

semi.df <- taxa.df %>% 
  filter(lake == "onon") %>% 
  semi_join(env.df, by = "unique_id")

DT::datatable(semi.df, options = list(scrollX = TRUE))

16.1.9.5 anti_join

sub.df <- taxa.df %>% 
  filter(lake == "onon")

anti.df <- anti_join(env.df, sub.df, by = "unique_id")

DT::datatable(anti.df, options = list(scrollX = TRUE))

16.1.9.6 nest_join

nest.df <- taxa.df %>% 
  filter(lake == "onon") %>% 
  nest_join(env.df, by = "unique_id")

DT::datatable(nest.df, options = list(scrollX = TRUE))

16.1.10 Bind

*Link: https://dplyr.tidyverse.org/reference/bind.html

16.1.10.1 bind_rows

onon.df <- taxa.df %>% 
  filter(lake == "onon")

otis.df <- taxa.df %>% 
  filter(lake == "ot")

caz.df <- taxa.df %>% 
  filter(lake == "caz")
bind.df <- bind_rows(onon.df, otis.df)

DT::datatable(bind.df, options = list(scrollX = TRUE))

16.1.10.2 bind_cols

hier.df <- onon.df %>% 
  select(phylum:species)

DT::datatable(hier.df, options = list(scrollX = TRUE))
count.df <- onon.df %>% 
  select(unique_id, final_id, count)

DT::datatable(count.df, options = list(scrollX = TRUE))
bind.df <- bind_cols(count.df, hier.df)

DT::datatable(bind.df, options = list(scrollX = TRUE))

16.2 tidyr

library(tidyr)
suppressPackageStartupMessages(
 library(dplyr) 
)

16.2.1 Example Data

Import the example data. This data represents benthic macroinvertebrate data collected in the littoral zone of Onondaga, Otisco, and Cazenovia lakes.

taxa.df <- file.path("data",
          "zms_thesis-macro_2017-06-18.csv") %>% 
  read.csv(stringsAsFactors = FALSE)

Preprocess taxa.df to just represent order-level taxonomic counts per sample. For more details about this process see the summarize section.

ord.df <- taxa.df %>% 
  select(unique_id, order, count) %>% 
  group_by(unique_id, order) %>% 
  summarize(count = sum(count)) %>% 
  ungroup()

DT::datatable(ord.df, options = list(columnDefs = list(list(className = 'dt-center', targets = 0:3))))

16.2.2 spread

In some instances it might be beneficial to transpose a data frame from a long format to a wide format, which can be simply done with spread(). spread() requires two columns to be specified:

  1. key: a column name, for which the unique values in the column will become column headers.
  2. value: a column name, for which the values will be represented in the rows appropriately associated with the key column.

The remaining columns will be used as an anchor point to represent a unique key for each row. If these remaining columns are not unique, an error will be returned. You will need to figure out why there are duplicate rows in the unique identifier columns and how to remedy the issue.

In the example below, ord.df is transformed from a long to a wide format. order values now represent column headers and the values from the count column have been filled in appropriately under the associated new order column headers.

wide.df <- ord.df %>% 
  spread(order, count)

DT::datatable(wide.df, options = list(scrollX = TRUE))

In the example above, any instance where a taxon was not found in a sample, the value is represented as an NA (represented as a blank space by the output of DT::datatable()). However, in this example it would be more appropriate to represent all of these values as a zero. To do this we just need to specify fill = 0 within the spread() call. In the table below, there are no NAs(represented as a blank space by the output of DT::datatable()).

wide.df <- ord.df %>% 
  spread(order, count, fill = 0)

DT::datatable(wide.df, options = list(scrollX = TRUE))

16.2.3 gather

In most instances, packages from the tidyverse are designed to operate on data in a long data format. gather() makes is it simple to convert a wide format to a long format.

In the this code chunk, the wide.df is used from the spread section and will be converted from a wide to a long data format using gather(). order represents a new column name, which by default will include all column names from wide.df. count also represents a new column name, which will represent all of the values that were present in each column. If just gather(order, count) is applied, the unique_id is included within the order column and the values within from the unique_id column are included within the count column. This is no correct. The next code chunk will correct this issue.

long.df <- wide.df %>% 
  gather(order, count)

DT::datatable(long.df,
              options = list(columnDefs = list(list(className = 'dt-center', targets = 0:2))))

Adding -unique_id to the end of the gather() call will retain unique_id in the final output but exclude this column from being included in the conversion of the remaining headers to the order column and the remaining row values to the count column. One or more columns can be excluded following the -unique_id example.

long.df <- wide.df %>% 
  gather(order, count, -unique_id)

DT::datatable(long.df,
              options = list(columnDefs = list(list(className = 'dt-center', targets = 0:3))))

16.2.4 separate

In some instances, a data frame may contain a column that has concatenated information that is separated by a common character or pattern. It may be beneficial to extract this concatenated information into separate columns to make it easier to perform subsequent tasks, such as filtering. separate() can be used to make this a simple task.

This example uses long.df from the gather section. In the separate() call:

  1. unique_id refers to the name of the column to be split
  2. c("lake", "station_id", "replicate") refers to the new column names that the split values from unique_id will fill
  3. sep = "_" specifies that the strings in unique_id should be split by underscores

The default of separate() is to remove (remove = TRUE) the original column unique_id from the returned data frame.

sep.df <- long.df %>% 
  separate(unique_id, c("lake", "station_id", "replicate"), sep = "_")

DT::datatable(sep.df,
              options = list(columnDefs = list(list(className = 'dt-center', targets = 0:5))))

I often find it helpful to specify remove = FALSE, which, in this example, will split unique_id into several new columns but also retain `unique_id in the final output.

sep.df <- long.df %>% 
  separate(unique_id,
           c("lake", "station_id", "replicate"),
           sep = "_",
           remove = FALSE)

DT::datatable(sep.df,
              options = list(columnDefs = list(list(className = 'dt-center', targets = 0:6))))

16.2.5 unite

unite() is the opposite of separate, it is used to combine values from multiple columns into a single string separated by a common character or pattern.

Using sep.df, created in the separate section, columns lake, station_id, and replicate can be concatenated into a single string within a new column using unite().

In the unite() call:

  1. "unique_id2" refers to the new column name that will contain the concatenated strings from the subsequently specified columns
  2. c("lake", "station_id", "replicate") refers to the column names that contain the values that we want to be concatenated in unique_id2.
  3. sep = "_" specifies that the strings from c("lake", "station_id", "replicate") should be denoted in unique_id2 by an underscore

The default of unite() is to remove (remove = TRUE) the concatenated columns (c("lake", "station_id", "replicate")) from the returned data frame.

unite.df <- sep.df %>% 
  unite("unique_id2",
        c("lake", "station_id", "replicate"),
        sep = "-")

DT::datatable(unite.df,
              options = list(columnDefs = list(list(className = 'dt-center', targets = 0:4))))

In some cases I find it useful to specify remove = FALSE, which, in this example, will retain lake, station_id, and replicate in the final output.

unite.df <- sep.df %>% 
  unite("unique_id2",
        c("lake", "station_id", "replicate"),
        sep = "-",
        remove = FALSE)

DT::datatable(unite.df,
              options = list(columnDefs = list(list(className = 'dt-center', targets = 0:7))))

16.3 lubridate

library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
suppressPackageStartupMessages(
 library(dplyr) 
)

16.3.1 Example Data

Import the example data. This data represents benthic macroinvertebrate data collected in the littoral zone of Onondaga, Otisco, and Cazenovia lakes.

taxa.df <- file.path("data",
          "zms_thesis-macro_2017-06-18.csv") %>% 
  read.csv(stringsAsFactors = FALSE)

Preprocess taxa.df to only include unique instances of station IDs and sample dates. For more details about this process see the select and distinct sections.

dates.df <- taxa.df %>% 
  select(station_id, date) %>% 
  distinct()

DT::datatable(dates.df, options = list(columnDefs = list(list(className = 'dt-center', targets = 0:2))))

16.3.2 mdy, ymd, dmy, ymd_hms, …

In dates.df, the date column is imported as a character class and follows a “mm/dd/yyyy” format. The function mdy() can be used convert the character strings in the date column to a date class.

mdy.df <- dates.df %>% 
  mutate(date = mdy(date))

DT::datatable(mdy.df, options = list(columnDefs = list(list(className = 'dt-center', targets = 0:2))))

In the example above, it is obvious the the format of the date has changed but it is not obvious that the R-class has changed. First look at the classes represented in the dates.df.

sapply(dates.df, class)
##  station_id        date 
## "character" "character"

Then looking at the column classes in myd.df, we can see date has changed to class “Date”.

sapply(mdy.df, class)
##  station_id        date 
## "character"      "Date"

16.3.3 year, month, mday, yday, hour, minute, and second

Once a column is a date or datetime class, then lubridate functions make it easy to extract parts of the date, such as year, month, day, hour, minutes, seconds, etc. In the mutate() call below, I applied many but not all of the helpful functions for extracting datetime related information. The majority of these are straight forward; however, we can change label and abbr to alter the output of functions like month() and wday().

  1. label
    • label = FALSE returns a numeric value
    • label = TRUE returns a character value
  2. abbr
    • If label = FALSE, then abbr has no effect
    • label = TRUE and abbr = TRUE returns an abbreviated character string
      • week: Sun, Mon, Tue, Wed, Thu, Fri, Sat
      • month: Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec
    • label = TRUE and abbr = FALSE returns an full character string
      • week: Sunday, Monday, Tuesday, Wednesday, Thursday, Friday, Saturday
      • month: January, February, March, April, May, June, July, August, September, October, November, December
extract.df <- mdy.df %>% 
  mutate(year = year(date),
         month_int = month(date),
         month_abv = month(date, label = TRUE),
         month_full = month(date, label = TRUE, abbr = FALSE),
         week = week(date),
         day = day(date),
         wday_int = wday(date),
         wday_abv = wday(date, label = TRUE),
         wday_full = wday(date, label = TRUE, abbr = FALSE),
         mday = mday(date),
         qday = qday(date),
         yday = yday(date),
         hour = hour(date),
         minute = minute(date),
         second = second(date))

DT::datatable(extract.df, options = list(scrollX = TRUE))

16.3.4 round_date, floor_date, and ceiling_date

round_date() will round the date or datetime by the specified unit of time, such as “15 minutes”, “week”, “month”, or “year”. I find it really convient that you can specify to the nearest “15 minutes”. floor_date() and ceiling_date() provide similar functionality but always round down or up, respectively.

round.df <- mdy.df %>% 
  mutate(round_week = round_date(date, "week"),
         round_month = round_date(date, "month"),
         round_year = round_date(date, "year"),
         round_year5 = round_date(date, "5 years"),
         round_century = round_date(date, "100 years"),
         floor_month = floor_date(date, "month"),
         floor_year = floor_date(date, "year"),
         ceiling_month = ceiling_date(date, "month"),
         ceiling_year = ceiling_date(date, "year"))

DT::datatable(round.df, options = list(scrollX = TRUE,
                                       autoWidth = TRUE,
                                       columnDefs = list(list(width = '70px', targets = c(2)))))

16.4 ggplot2

library(ggplot2)

suppressPackageStartupMessages(
 library(dplyr) 
)

library(tidyr)

16.4.1 Example Data

Import the example data. This data represents benthic macroinvertebrate data collected in the littoral zone of Onondaga, Otisco, and Cazenovia lakes.

taxa.df <- file.path("data",
          "zms_thesis-macro_2017-06-18.csv") %>% 
  read.csv(stringsAsFactors = FALSE)

Preprocess taxa.df to just represent order-level taxonomic counts per sample. For more details about this process see the summarize section.

ord.df <- taxa.df %>% 
  select(unique_id, station_id, lake, order, count) %>% 
  group_by(unique_id, station_id, lake, order) %>% 
  summarize(count = sum(count)) %>% 
  ungroup() %>% 
  group_by(station_id, lake, order) %>% 
  summarize(count = mean(count)) %>% 
  ungroup()

DT::datatable(ord.df, options = list(columnDefs = list(list(className = 'dt-center', targets = 0:3))))

Calculate the calculate relative abundance (i.e., the percentage of the sample represented by each taxon) of each taxon within a sample. For more details about this process see the group_by section.

pct.df <- ord.df %>% 
  group_by(station_id) %>% 
  mutate(total = sum(count),
         percent = count / total * 100) %>% 
  ungroup() %>% 
  tidyr::complete(order,
                  nesting(station_id, lake, total),
                  fill = list(count = 0, percent = 0)) %>% 
  mutate(lake = factor(lake, levels = c("onon", "ot", "caz")))

Import environmental data associated with the taxonomic counts.

env.df <- file.path("data",
                     "zms_thesis-env_2017-06-18.csv") %>% 
  read.csv(stringsAsFactors = FALSE) %>% 
  mutate(station_id = stringr::str_sub(unique_id, 1, -3)) %>% 
  select(-unique_id) %>% 
  distinct()

DT::datatable(env.df, options = list(scrollX = TRUE))

16.4.2 ggplot

Generally, you want to start with a ggplot() call. Here we can see a blank figure template returned when ggplot() is called and pct.df is specified as data. The pipe operator can be used to pipe data into ggplot() but it cannot be used to chain subsequent ggplot2 functions together.

ggplot(pct.df)

pct.df %>% 
  ggplot()

16.4.3 aes

To start to format the figure, you will want to use aes(). When using aes() within ggplot(), you are specifying global variables. For instance, you will generally specify the columns, from the specified data frame (e.g., pct.df), that will represent your x-axis and y-axis. In the example below, the lake column represents the x-axis, while the percent column will represents the y-axis. Now we have a template for our figure but we will want to add layers (boxplot, scatter plot, bar plot) to this feature to visualize our data.

pct.df %>% 
ggplot(aes(x = lake, y = percent))

### Adding Layers (+)

ggplot2 does not recognize the pipe operator (%>%), instead it uses + to add components to a plot.

Now let’s add data to the plot by adding a + at the end of ggplot() and then specifying the plot type we are interested in creating. All plot functions in ggplot2 start with the prefix “geom_”.

16.4.4 geom_boxplot

Here we will create a box and whisker plot using geom_boxplot(). For this example, we will only focus on the percentage of Ephemeroptera (Mayflies) found within each lake by using the filter() function before the ggplot() function. The aesthetics have been specified within ggplot(), therefore geom_boxplot() can be called without any arguments.

pct.df %>%
  filter(order == "ephemeroptera") %>% 
ggplot(aes(x = lake, y = percent)) +
  geom_boxplot()

16.4.4.1 Fill

The color of the interquartile boxes can be specified with fill. In this example, fill = lake, which ggplot2 will automatically realize represents three categories and use a default color palette that is used to automatically color the plot. fill is specified within aes() within the ggplot() , and therefore this specification carries through to the subsequent geom_boxplot() call.

pct.df %>%
  filter(order == "ephemeroptera") %>% 
ggplot(aes(x = lake, y = percent, fill = lake)) +
  geom_boxplot()

16.4.4.2 Color

Color can be sepecified in a similar manner to Fill. In the plot below, we can see the difference between color and fill (the previous plot).

pct.df %>%
  filter(order == "ephemeroptera") %>% 
ggplot(aes(x = lake, y = percent, color = lake)) +
  geom_boxplot()

When I first started using ggplot2, I would frequently try to put the color or other aesthetic arguments outside of the aes() call. For example, in ggplot(aes(x = lake, y = percent), color = lake), color = lake is not within the aes() call. Therefore, it has no effect on the plot (i.e., there is no color added to the plot).

pct.df %>%
  filter(order == "ephemeroptera") %>% 
ggplot(aes(x = lake, y = percent), color = lake) +
  geom_boxplot()

aes() and the color or other aesthetics can also be specified within the geom_boxplot() call. This allows you to be more specific of which feature you want to recieve a given aesthetic.

pct.df %>%
  filter(order == "ephemeroptera") %>% 
ggplot(aes(x = lake, y = percent)) +
  geom_boxplot(aes(color = lake))

Additionally, if you want to specify specific colors not based on features from your data frame (e.g., lake), then you can specify these colors within the geom_boxplot() call. Note, that color is NOT specified within an aes() call.

pct.df %>%
  filter(order == "ephemeroptera") %>% 
ggplot(aes(x = lake, y = percent)) +
  geom_boxplot(color = c("purple", "orange", "brown"))

If you do attempt to wrap manually specified colors, like color = c("purple", "orange", "brown"), within aes(), you will get an error.

pct.df %>%
  filter(order == "ephemeroptera") %>% 
ggplot(aes(x = lake, y = percent)) +
  geom_boxplot(aes(color = c("purple", "orange", "brown")))
## Error: Aesthetics must be either length 1 or the same as the data (48): colour

Also, trying to manually specify colors within the ggplot() call will have no effect on the plot.

pct.df %>%
  filter(order == "ephemeroptera") %>% 
ggplot(aes(x = lake, y = percent), color = c("purple", "orange", "brown")) +
  geom_boxplot()

16.4.4.3 geom_point

Let’s add another layer to the plot above. geom_point() can be used to vizualize where the data points actually fall relative to the box and whisker plots.

pct.df %>%
  filter(order == "ephemeroptera") %>% 
ggplot(aes(x = lake, y = percent)) +
  geom_boxplot() +
  geom_point()

16.4.4.4 geom_jitter

In the geom_point plot above, it is not possible to know if a point represents a single sample or if there are more than one samples with the same value overlaid on one another. geom_jitter() can be used to randomly offset the points from thier actual position to avoid the overlap issue present in geom_point().

In this example, geom_jitter() is used to offest the points overlaid on the box and whisker plots. The points are colored by lake within this call: geom_jitter(aes(color = lake)). Additionally, the outliers plotted from geom_boxplot() are hidden with geom_boxplot(outlier.shape = NA). These outliers are represented in the geom_jitter() call; therefore, if the outliers were not removed from geom_boxplot(), these points would be represented twice.

pct.df %>%
  filter(order == "ephemeroptera") %>% 
ggplot(aes(x = lake, y = percent)) +
  geom_boxplot(outlier.shape = NA) +
  geom_jitter(aes(color = lake))

Here we can see what happens if color is specified within the ggplot() call. Now the box and whisker lines and the points are colored by lake. This is in contrast to the last plot, where only the points were colored by lake.

pct.df %>%
  filter(order == "ephemeroptera") %>% 
ggplot(aes(x = lake, y = percent, color = lake)) +
  geom_boxplot(outlier.shape = NA) +
  geom_jitter()

16.4.4.5 geom_violin

This plot helps us to see that there are a lot of samples from Onondaga Lake (“onon”) that have no Ephemeroptera (percent = 0).

pct.df %>%
  filter(order == "ephemeroptera") %>% 
ggplot(aes(x = lake, y = percent, fill = lake)) +
  geom_violin()

16.4.4.6 geom_dotplot

Similar to [violin_plots], geom_dotplot() can be more informative than a box and whisker plot.

pct.df %>%
  filter(order == "ephemeroptera") %>% 
ggplot(aes(x = lake, y = percent, fill = lake)) +
  geom_dotplot(binaxis = "y",
               stackdir = "center",
               binwidth = 0.25)

geom_dotplot() could be overlaid on geom_boxplot(), which may be a better than a geom_jitter example.

pct.df %>%
  filter(order == "ephemeroptera") %>% 
ggplot(aes(x = lake, y = percent)) +
  geom_boxplot(outlier.shape = NA) +
  geom_dotplot(aes(fill = lake),
               binaxis = "y",
               stackdir = "center",
               binwidth = 0.25)

16.4.4.7 geom_bar

For this section, the data will need to be summarized by lake. The data is aggregated by lake and order, and the median abundance (count) is calculated. The top five taxa observed are retained as normal but the remaining taxa are lumped and reclassified as “Other” (forcats::fct_lump(factor(order), n = 5, w = percent)). The taxa are then sorted in descending order based on the taxa most commonly observed (forcats::fct_reorder(order, percent, median, .desc = TRUE)).

abund.df <- pct.df %>% 
  group_by(lake, order) %>% 
  summarise(count = median(count)) %>% 
  ungroup() %>% 
  mutate(order = forcats::fct_lump(factor(order),
                                   n = 5,
                                   w = count),
         order = forcats::fct_reorder(order, count, median, .desc = TRUE),
         lake = factor(lake, c("onon", "ot", "caz")))

DT::datatable(abund.df, options = list(scrollX = TRUE))

Using geom_bar(), we can generate a bar plot. state = "identity" must be used when a y value is specified in aes(); otherwise, ggplot2 will simiply count the number of x value instances to create a y-axis count.

WARNING: This plot is a quick example of geom_bar() and is not a good representation of the data. In this case, the median abundance was calculated for each taxon within each lake. geom_bar() is summing the median values per taxon per lake to produce this figure.

abund.df %>% 
ggplot(aes(x = order, y = count)) +
  geom_bar(stat = "identity")

As the warning above notes, the above plot is not a great representation of the data because it essentially sums the three median abundance values per taxon per lake to produce the bars. We can see this breakdown by adding fill = TRUE. This is also a poor way of view this data.

abund.df %>% 
ggplot(aes(x = order, y = count, fill = lake)) +
  geom_bar(stat = "identity")

position = "dodge" can be added to the geom_bar() call to make the bars for each lake appear side-by-side.

abund.df %>% 
ggplot(aes(x = order, y = count, fill = lake)) +
  geom_bar(stat = "identity",
           position = "dodge")

16.4.4.7.1 Stacked Bar Plots

Another great way to visualize this data is to create stacked bar plots. Stacked bar plots are almost always the better alternative to pie charts. In general, stack bar charts provide a more straight forward representation of quantity, which makes it easier to compare multiple stacked bar charts, then multiple pie charts.

  • Note that in this section the x-axis now represents lake, the y-axis represents count, and the fill represents order.

Stacked bar plots can be created by adding position = "stack" to the geom_bar() call. This information is exactly the same as the geom_bar() plot above, where position = "dodge" but it is more condensed.

abund.df %>% 
  mutate(order = forcats::fct_reorder(order, count, median)) %>% 
ggplot(aes(x = lake, y = count, fill = order)) +
  geom_bar(stat = "identity",
           position = "stack")

Stacked bar plots can also be created by adding position = "fill" to the geom_bar() call. This effectively calculates relative abundance, the percentage each taxon represents within a lake. This normalizes the data and makes it easier to compare between lakes.

abund.df %>% 
  mutate(order = forcats::fct_reorder(order, count, median)) %>% 
ggplot(aes(x = lake, y = count, fill = order)) +
  geom_bar(stat = "identity",
           position = "fill")

pct.df %>% 
  select(station_id, lake, percent, order) %>% 
  spread(order, percent) %>% 
ggplot(aes(x = diptera, y = amphipoda)) +
  geom_point()

pct.df %>% 
  select(station_id, lake, percent, order) %>% 
  spread(order, percent) %>% 
ggplot(aes(x = diptera, y = amphipoda)) +
  geom_point(aes(color = lake))

16.4.4.8 geom_line

pct.df %>% 
  select(station_id, lake, percent, order) %>% 
  spread(order, percent) %>% 
ggplot(aes(x = diptera, y = amphipoda)) +
  geom_point(aes(color = lake)) +
  geom_line()

16.4.4.9 geom_smooth

pct.df %>% 
  select(station_id, lake, percent, order) %>% 
  spread(order, percent) %>% 
ggplot(aes(x = diptera, y = amphipoda)) +
  geom_point(aes(color = lake)) +
  geom_smooth(method = "lm", formula = y ~ x)

pct.df %>% 
  select(station_id, lake, percent, order) %>% 
  spread(order, percent) %>% 
ggplot(aes(x = diptera, y = amphipoda, color = lake)) +
  geom_point() +
  geom_smooth(method = "lm", formula = y ~ x)

pct.df %>% 
  select(station_id, lake, percent, order) %>% 
  spread(order, percent) %>% 
ggplot(aes(x = diptera, y = amphipoda)) +
  geom_point(aes(color = lake)) +
  geom_smooth(method = "loess", formula = y ~ x)

pct.df %>% 
  select(station_id, lake, percent, order) %>% 
  spread(order, percent) %>% 
ggplot(aes(x = diptera, y = amphipoda, color = lake, fill = lake)) +
  geom_point(aes(color = lake)) +
  geom_smooth(method = "loess", formula = y ~ x)

16.4.4.10 facet_wrap

pct.df %>% 
  select(station_id, lake, percent, order) %>% 
  spread(order, percent) %>% 
  gather(order, count, amphipoda:veneroida, -diptera) %>% 
ggplot(aes(x = diptera, y = count, color = lake, fill = lake)) +
  geom_point(aes(color = lake)) +
  geom_smooth(method = "loess", formula = y ~ x, se = FALSE) + 
  facet_wrap(~order,
             ncol = 1,
             scales = "free")

17 Shiny

17.1 What is Shiny?

Shiny is an R package that enables the developer to create interactive web applications (apps).

My Example: https://zsmith.shinyapps.io/WQT_Shiny/

17.2 Resources

R Studio has a website dedicated shiny (https://shiny.rstudio.com/). There are a lot of resources available here but for those just learning shiny or looking for a shiny refresher I would direct you to the tutorial page (https://shiny.rstudio.com/tutorial/).

17.3 Project Composition

Shiny apps are mainly composed of three files: 1) ui.R, 2) server.R, and 3) global.R.

17.3.1 ui.R

In this file, you will specify the aesthetic aspects of the app. This includes: the presence/absence of a navigation bar, the presence and position of a dropdown menu, the presence and position of a sliderbar, and the location of figure created in the server.R file.

17.3.2 server.R

In this file, you will specify reactive functions that respond to user inputs. For example, an app may contain a dropdown menu of sample sites and scatterplot representing the site selected in the dropdown menu. Each time the user selects a new sample site from the dropdown menu the scatterplot would update.

17.3.3 global.R

In this file, you should include code that is static. This includes: loading libraries, sourcing functions, and potentially importing data. These activities are intended to occur once and will not be reacting to user inputs.

17.3.4 Structure

Shiny projects generally grow rapidly and it can become difficult to navigate hundreds of lines of code in a ui.R or server.R file. My preference is to break the code up into independent and more manageable scripts that are sourced in the ui.R or server.R files. For example, imagine you are developing an app which contains two tabs, one dedicated to tabular data and one dedicated to an interactive map. I would develop separate R scripts for server code associated with each tab. Similarly, I would create separate R scripts for the ui code associated with each tab. These files are stored in the appropriate folders labeled either “ui” or “server.” When sourcing files in a shiny app you must specify “local = TRUE”.

source("server/select_import_server.R", local = TRUE)

17.3.5 R-Packages

Most Shiny apps will require multiple R-packages. I recommend loading all of the necessary R-packages in the global.R file. This makes it simple to identify all the packages you must have installed locally to edit and develop a given shiny app. One way to simplify this task is to use the example provided by the following link: https://www.r-bloggers.com/install-and-load-missing-specifiedneeded-packages-on-the-fly/. Following this scripts template:

  1. You specify all of the necessary R-packages.
  2. The code checks that all these packages are installed.
  3. If any packages are not installed, the code will install these packages.

This makes it easier to collaborate with others or work on multiple computers.

17.4 Helpful R-Packages

17.4.1 DT (Interactive Tables)

*Link: https://rstudio.github.io/DT/

The R-package, DT, is great resource for creating interactive tables.

17.4.2 dygraphs (Interactive Time Series Plots)

*Link: https://rstudio.github.io/dygraphs/

The R-package, dygraphs, is great resource for creating interactive time series plots.

17.4.3 leaflet (Interactive Maps)

*Link: http://rstudio.github.io/leaflet/

The R-package, Leaflet, is a great resource for creating interactive maps. When using this package in shiny there are a few steps you need to take to make the map function well (http://rstudio.github.io/leaflet/shiny.html). It is generally useful to create a leafletProxy, which will load the base map as shiny output. You can then use reactive functions to update the points presented on the map. Using leafletProxy, only the map points will update, the base map will remain unchanged. This prevents the need to reload the entire map each time the points are updated, which makes it appear that the map is flashing.

17.4.4 plotly (Interactive Figures)

*Link: https://plot.ly/r/

The R-package, plotly, is great resource for creating interactive figures.

17.5 Publishing

*Link: http://www.shinyapps.io/

shinyapps.io is a shiny hosting platform provided by R Studio. Users must create a shinyapps.io free account or a paid account. The free account limits the number of applications you can publish and the number of hours the app can be active per month. There are multiple tiers to the paid accounts. As the user pays more, the user can publish a greater amount of applications, more active hours are available per month, and other additional benefits are supplied by R Studio.

17.5.1 How to Publish to shinyapps.io

  1. Click on the “Publish to Server” button located in the top right corner of the source pane.

  2. A drop-down menu will appear. Select the appropriate shinyapps.io app.
    • If this is your first time connecting to a shinyapps.io account.
      • Make sure you have created a shinyapps.io account (https://www.shinyapps.io/admin/#/login).
      • In the drop-down menu select “Manage Accounts.” This will bring you to the “Publishing Accounts” section of R Studios “Options.”
      • Select “Connect…” -> “ShinyApps.io” option -> follow instructions.
  3. The “Publish to Server” window will appear. Select all of the files that are necessary for the app to run. Do not include unnecessary files, as this could slow down your app or make the app too large to host under your current account settings. Check the “Launch browser” check box. Click “Publish.”
  4. A “Deploy” tab will appear in the console pane, which will inform you that R Studio is working on publishing your app. This will take a few minutes. If there are any issues, the app will stop deploying and you will receive an error message.